This notebook provides an introduction to fitting (Support Vector Machine) SVM classifiers. It first reviews logistic regression to highlight the connections between Support Vector Machines (SVMs) and logistic regression. It also reviews important preprocessing operations via Pipelines for generating categorical and continuous features. It then discusses the major arguments to the sklearn functions and illustrates tuning the major parameters via cross-validation. Preprocessing and cross-validation are managed with Pipelines.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Let's use the CMPINF 2120 binary classification dataset from Week 08. This allows us to compare the SVM performance to simple generalized linear models that we focused on earlier in the semester. This example also includes just 3 inputs which makes it easy to visualize predictions from the model. The data are loaded below.
df = pd.read_csv('../week_08/cmpinf_2120_binary_classification.csv')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 300 entries, 0 to 299 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 300 non-null float64 1 x2 300 non-null float64 2 x3 300 non-null object 3 y 300 non-null int64 dtypes: float64(2), int64(1), object(1) memory usage: 9.5+ KB
As shown above there are 3 inputs, x1, x2, and x3. Two of the inputs are continuous numbers and the third input x3 is a categorical. The output y is an integer. The .nunique() method reveals the third has 3 unique values and the output y has 2 unique values.
df.nunique()
x1 300 x2 300 x3 3 y 2 dtype: int64
The output y is binary but has been encoded as y=1 for the EVENT and y=0 for the NON-EVENT. The .value_counts() method below shows the output is roughly balanced.
df.y.value_counts(normalize=True)
0 0.553333 1 0.446667 Name: y, dtype: float64
This notebook will not explore the data. The notebook's from Week 04 include all necessary figure for exploring this data set. Please see Week 04 if you need review and practice with exploring a classification data set.
We introduced binary classification with logistic regression. We saw earlier in the semester that logistic regression suffers from linear separability issues. If the observed EVENTS can be separated from the observed NON-EVENTS linearly relative to the features the logistic regression model cannot be fit by maximizing the likelihood. The maximum likelihood estimates (MLEs) do not exist because there are infinity many combinations that linearly separate the classes.
We saw this very thing in lecture using this simple data set. We created a smaller version which suffered from linear separation. Let's recreate that small data set again. The cell below slices the first 20 rows and assigns the result to the df_small object.
df_small = df.iloc[:20, :].copy()
df_small.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20 entries, 0 to 19 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 20 non-null float64 1 x2 20 non-null float64 2 x3 20 non-null object 3 y 20 non-null int64 dtypes: float64(2), int64(1), object(1) memory usage: 768.0+ bytes
The x3 variable only has a single value in the smaller data set. We can thus focus on just the two continuous inputs.
df_small.nunique()
x1 20 x2 20 x3 1 y 2 dtype: int64
The figure below shows a scatter plot between x2 and x1. The output category is shown by the marker color (hue semantic variable). Do you see the white space between the orange and blue dots in the figure below? We can separate the EVENTS from the NON-EVENTS by drawing circles in between the orange and blue dots!
sns.relplot(data = df_small, x='x1', y='x2', hue='y', s=250)
plt.show()
The fact that we can separate all observed EVENTS from the observed NON-EVENTS means logistic regression will break down! However, that statement is more nuanced than it first appears. Adding a vertical or horizontal reference line to the above plot will not separate the observed classes! Two example decision boundaries are included below. One is a vertical line located at x1=0 and the other is a horizontal line located at x2=0. Neither line perfectly separates y=1 from y=0.
fig, ax = plt.subplots()
sns.scatterplot(data = df_small, x='x1', y='x2', hue='y', s=250, ax=ax)
ax.axvline(0, color = 'red')
ax.axhline(0, color = 'magenta', linestyle='--')
plt.show()
A diagonal line will also not perfectly separate the two classes. The figure below includes 4 example diagonal lines. None of those 4 diagonal lines are capable of separating EVENTS from NON-EVENTS!
fig, ax = plt.subplots()
sns.scatterplot(data = df_small, x='x1', y='x2', hue='y', s=250, ax=ax)
ax.plot([-2, 2], [-2, 2], color='red')
ax.plot([-2, 0.5], [0.2, 2], color='red', linestyle=':')
ax.plot([-2, 2], [2, -2], color='magenta', linestyle='--')
ax.plot([-2, 0.5], [0.5, -2], color='magenta', linestyle=':')
plt.show()
This means if we fit the "classic" logistic regression model with linear additive features the model will not crash in this case! We fit this model back in Week 05. Let's do that again to confirm the model can be fit. The statsmodels formula interface api is imported below.
import statsmodels.formula.api as smf
The logistic regression model with linear additive features is fit below.
fit_x1x2_add = smf.logit( formula = 'y ~ x1 + x2', data = df_small).fit()
Optimization terminated successfully.
Current function value: 0.668869
Iterations 5
The model fit successfully! The coefficient MLEs are displayed below.
fit_x1x2_add.params
Intercept -0.215276 x1 -0.435771 x2 -0.052625 dtype: float64
Let's make predictions with this model to confirm the behavior of the predicted event probability. A full-factorial grid between x1 and x2 is created below. The input bounds are each -2.1 to 2.1 to cover the space of the smaller data set.
input_viz_small = pd.DataFrame([ (xa, xb) for xa in np.linspace(-2.1, 2.1, num=101)
for xb in np.linspace(-2.1, 2.1, num=101) ],
columns=['x1', 'x2'])
input_viz_small.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10201 entries, 0 to 10200 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 10201 non-null float64 1 x2 10201 non-null float64 dtypes: float64(2) memory usage: 159.5 KB
The predictions are made and visualized below. The figure shows the event probability decreases as x1 increases which is consistent with the sign of the slope multiplying x1 as shown previously.
input_viz_small_copy = input_viz_small.copy()
input_viz_small_copy['pred_prob_add'] = fit_x1x2_add.predict( input_viz_small )
sns.relplot( data = input_viz_small_copy, x='x1', y='x2', hue='pred_prob_add',
kind='scatter', palette='vlag' )
plt.show()
The graphic above is slightly challenging to interpret because of the default Seaborn diverging color palette. Because of that, let's create a new column which binarizes the predicted probability as greater than 0.5.
input_viz_small_copy['pred_prob_add_bin'] = pd.cut( input_viz_small_copy.pred_prob_add,
bins = [0, 0.5, 1],
include_lowest=True )
input_viz_small_copy.pred_prob_add_bin.value_counts()
(-0.001, 0.5] 6288 (0.5, 1.0] 3913 Name: pred_prob_add_bin, dtype: int64
We can now color by the binarized predicted probability values. The blue dots are all associated with predicted probability values less than 50%.
sns.relplot( data = input_viz_small_copy, x='x1', y='x2', hue='pred_prob_add_bin',
kind='scatter', s=7.5 )
plt.show()
Let's add in the training set and denote the observed output class via the style semantic variable. As shown below the linear additive features do not separate all events from non-events. The logistic regression model with linear additive features creates linear decision boundaries.
fig, ax = plt.subplots()
sns.scatterplot(data = input_viz_small_copy, x='x1', y='x2', hue='pred_prob_add_bin', s=7.5, ax=ax)
sns.scatterplot( data = df_small, x='x1', y='x2', style='y', color='black', s=200, ax=ax)
plt.show()
But, we know we can derive non-linear features from the inputs! We know we can include interactions between inputs and polynomials from the inputs! Thus, we can allow the model to "draw a circle" around the events! However, there are many possible circles that separate the EVENTS from the NON-EVENTS. One such circle is shown below.
fig, ax = plt.subplots()
ax.add_patch( plt.Circle((-0.1, 0), radius=1.05, color='grey'))
sns.scatterplot(data = df_small, x='x1', y='x2', hue='y', s=250, ax=ax)
plt.show()
Here's another circle that separates the EVENTS from NON-EVENTS!
fig, ax = plt.subplots()
ax.add_patch( plt.Circle((-0.5, 0), radius=1.25, color='grey'))
sns.scatterplot(data = df_small, x='x1', y='x2', hue='y', s=250, ax=ax)
plt.show()
Here's one more!
fig, ax = plt.subplots()
ax.add_patch( plt.Circle((-0.2, 0.05), radius=0.98, color='grey'))
sns.scatterplot(data = df_small, x='x1', y='x2', hue='y', s=250, ax=ax)
plt.show()
Each of the previous 3 circles completely surrounded the EVENTS and separated the EVENTS from the NON-EVENTS. However, as we have seen there are many circles that can accomplish this. Quadratic features derived from the two continuous inputs are capable of drawing the previously shown circles. Thus, if we include such features there are infinitely many circles that can be created. Fitting the logistic regression model by maximizing the likelihood will therefore cause the model to crash! We saw this earlier in the semester, but let's try it again. The cell below includes main effects and interactions between the two continuous inputs as well as quadratic features derived from those inputs. As shown below, the logistic regression model results in a crash! The last line of the error message states Perfection Separation detected, results not available. The last line is therefore stating the model cannot be fit because the EVENTS are separated from the NON-EVENTS!
smf.logit(formula = 'y ~ x1 * x2 + np.power(x1, 2) + np.power(x2, 2)', data = df_small).fit()
--------------------------------------------------------------------------- PerfectSeparationError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_9120\4140060627.py in <module> ----> 1 smf.logit(formula = 'y ~ x1 * x2 + np.power(x1, 2) + np.power(x2, 2)', data = df_small).fit() ~\Anaconda3\envs\cmpinf2120\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 1981 def fit(self, start_params=None, method='newton', maxiter=35, 1982 full_output=1, disp=1, callback=None, **kwargs): -> 1983 bnryfit = super().fit(start_params=start_params, 1984 method=method, 1985 maxiter=maxiter, ~\Anaconda3\envs\cmpinf2120\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 228 pass # TODO: make a function factory to have multiple call-backs 229 --> 230 mlefit = super().fit(start_params=start_params, 231 method=method, 232 maxiter=maxiter, ~\Anaconda3\envs\cmpinf2120\lib\site-packages\statsmodels\base\model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs) 561 562 optimizer = Optimizer() --> 563 xopt, retvals, optim_settings = optimizer._fit(f, score, start_params, 564 fargs, kwargs, 565 hessian=hess, ~\Anaconda3\envs\cmpinf2120\lib\site-packages\statsmodels\base\optimizer.py in _fit(self, objective, gradient, start_params, fargs, kwargs, hessian, method, maxiter, full_output, disp, callback, retall) 239 240 func = fit_funcs[method] --> 241 xopt, retvals = func(objective, gradient, start_params, fargs, kwargs, 242 disp=disp, maxiter=maxiter, callback=callback, 243 retall=retall, full_output=full_output, ~\Anaconda3\envs\cmpinf2120\lib\site-packages\statsmodels\base\optimizer.py in _fit_newton(f, score, start_params, fargs, kwargs, disp, maxiter, callback, retall, full_output, hess, ridge_factor) 441 history.append(newparams) 442 if callback is not None: --> 443 callback(newparams) 444 iterations += 1 445 fval = f(newparams, *fargs) # this is the negative likelihood ~\Anaconda3\envs\cmpinf2120\lib\site-packages\statsmodels\discrete\discrete_model.py in _check_perfect_pred(self, params, *args) 212 np.allclose(fittedvalues - endog, 0)): 213 msg = "Perfect separation detected, results not available" --> 214 raise PerfectSeparationError(msg) 215 216 @Appender(base.LikelihoodModel.fit.__doc__) PerfectSeparationError: Perfect separation detected, results not available
Mathematically, the reason why the function crashes is because the coefficients have become very large. Extreme regression coefficients cause the model to break down.
How do we overcome the linear separation issue? We discussed that regularizing the coefficients and thus preventing extreme coefficient values will allow fitting the logistic regression model. Let's apply the RIDGE penalty to logistic regression and fit the model again. We will use Pipelines and the sklearn logistic regression function instead of statsmodels to execute the penalized logistic regression model.
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
Let's create the features via the Polynomial Features preprocessing function.
from sklearn.preprocessing import PolynomialFeatures
The ridge penalized regression model is initialized below. It uses a large value for the C parameter to represent a small regularization strength. This allows the data to dominate the coefficient estimates. The intercept is specified to be fit following the conventional sklearn function behavior.
ridge_high_c = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=25001, C=10000, fit_intercept=True)
The Pipeline below generates the polynomials as the first step and then applies the ridge model. The include_bias argument is set to False within PolynomialFeatures() because the fit_intercept argument was set to True in LogisticRegression().
poly_ridge_wflow = Pipeline(steps=[('make_poly', PolynomialFeatures(degree=2, include_bias=False)),
('ridge', ridge_high_c)])
Let's separate the inputs from the output and convert both to NumPy arrays.
X_small = df_small.loc[:, ['x1', 'x2'] ].copy().to_numpy()
y_small = df_small.y.to_numpy().ravel()
The input and output array shapes are shown below. The output is a 1D array.
X_small.shape
(20, 2)
y_small.shape
(20,)
Let's fit the model by calling the .fit() method on the Pipeline object.
poly_ridge_fit = poly_ridge_wflow.fit( X_small, y_small )
The .steps attribute of the fitted object shows the steps in the Pipeline.
poly_ridge_fit.steps
[('make_poly', PolynomialFeatures(include_bias=False)),
('ridge', LogisticRegression(C=10000, max_iter=25001))]
We can access the objects associated with each step using the .named_steps[] attribute. The Ridge penalized logistic regression model is accessed as .named_steps['ridge'] because we named the Ridge model 'ridge' within the Pipeline. We can therefore access all attributes associated with the LogisticRegression() fit model!
We provided a NumPy array with 2 columns as the input array. However, the Pipeline created the main effects, interactions, and quadratic features for us! That means in addition to the intercept:
poly_ridge_fit.named_steps['ridge'].intercept_
array([13.90332902])
There are 5 coefficients that must be estimated! The model has more features than inputs because of PolynomialFeatures()!
poly_ridge_fit.named_steps['ridge'].coef_
array([[ -5.36740417, 1.6212996 , -8.95851504, -0.5115837 ,
-10.9258002 ]])
The above print displays show the coefficients can be relatively large. The intercept and the last coefficient are double digits. We used a high C value and thus a weak regularization strength. If we had instead used an unpenalized logistic regression model (penalty = 'none') the coefficients would have been even larger! Remember that sklearn does not tell us when logistic regression crashed. It simply provides the last coefficients before the crash! The ridge penalty will not crash because even a very high C value prevents the enormous coefficient estimates.
Let's examine the behavior of this Ridge penalized logistic regression model via predictions. We do not need to create all the features though! The Pipeline will make the features for us. We only need to create the inputs and Pipeline does the rest of the work for us! Pipeline therefore enables the workflows similar to the formula interface from statsmodels. The formula interface "remembers" the actions necessary for a model. Pipelines does the same thing, but provides us with greater flexibility.
However, since the original training inputs were provided as a NumPy array, let's also make the prediction/visualization grid inputs a NumPy array.
X_viz_small = input_viz_small.loc[:, ['x1', 'x2'] ].copy().to_numpy()
X_viz_small.shape
(10201, 2)
The probability predictions are returned via the .predict_proba() method. The first few rows of the returned object are provided below. There are two columns, one for each class or unique value of the categorical output.
poly_ridge_fit.predict_proba( X_viz_small )[:6, :]
array([[1.00000000e+00, 2.46400257e-30],
[1.00000000e+00, 1.85977647e-29],
[1.00000000e+00, 1.35064100e-28],
[1.00000000e+00, 9.43797232e-28],
[1.00000000e+00, 6.34566376e-27],
[1.00000000e+00, 4.10520643e-26]])
The .classes_ attribute tells us the classes or unique values associated with the model. The y=1 value is the second class as shown below.
poly_ridge_fit.classes_
array([0, 1], dtype=int64)
Thus, we should extract the 1 column from the returned .predict_proba() object. The exact class you should focus on in your applications will depend on the output values associated with your categorical output. In this case, the output variable was encoded as y=1 for the EVENT and y=0 for the NON-EVENT. That's why we can use the 1 column for this example. The 1 column is extracted and assigned to a new column in the input_viz_small_copy dataframe in the cell below.
input_viz_small_copy['ridge_pred_prob'] = poly_ridge_fit.predict_proba( X_viz_small )[:, 1].ravel()
input_viz_small_copy
| x1 | x2 | pred_prob_add | pred_prob_add_bin | ridge_pred_prob | |
|---|---|---|---|---|---|
| 0 | -2.1 | -2.100 | 0.692185 | (0.5, 1.0] | 2.464003e-30 |
| 1 | -2.1 | -2.058 | 0.691714 | (0.5, 1.0] | 1.859776e-29 |
| 2 | -2.1 | -2.016 | 0.691242 | (0.5, 1.0] | 1.350641e-28 |
| 3 | -2.1 | -1.974 | 0.690771 | (0.5, 1.0] | 9.437972e-28 |
| 4 | -2.1 | -1.932 | 0.690298 | (0.5, 1.0] | 6.345664e-27 |
| ... | ... | ... | ... | ... | ... |
| 10196 | 2.1 | 1.932 | 0.225820 | (-0.001, 0.5] | 5.405795e-34 |
| 10197 | 2.1 | 1.974 | 0.225434 | (-0.001, 0.5] | 9.213134e-35 |
| 10198 | 2.1 | 2.016 | 0.225048 | (-0.001, 0.5] | 1.510827e-35 |
| 10199 | 2.1 | 2.058 | 0.224663 | (-0.001, 0.5] | 2.383866e-36 |
| 10200 | 2.1 | 2.100 | 0.224278 | (-0.001, 0.5] | 3.619166e-37 |
10201 rows × 5 columns
The ridge penalized logistic regression model predictions are visualized in the Seaborn figure below. The diverging color scale shows bright red within the circle and dark blue outside the circle. Red corresponds to high event probability and blue denotes low event probability.
sns.relplot( data = input_viz_small_copy, x='x1', y='x2', hue='ridge_pred_prob',
kind='scatter', palette='vlag', s=7.5 )
plt.show()
Let's cut the predicted probability into bins as we did previously. However, this time let's use 3 bins instead of two. This way we can include a large "buffer" around the conventional decision threshold of 50%. The pd.cut() function below is used to create bins for predicted probability values between 0 and 10%, between 10% and 90%, and 90% to 100%. We are thus identifying predictions close to 0, predictions close to 1, and then everything in between. At first glance this might seem odd to do. However, it will make more sense as to why this was done when we visualize the predictions.
input_viz_small_copy['ridge_pred_prob_bin'] = pd.cut( input_viz_small_copy.ridge_pred_prob,
bins = [0, 0.1, 0.9, 1],
include_lowest=True )
The cut predicted probabilities are visualized below. Even though the "buffer" zone covers 10% to 90% probability the figure has relatively few markers colored orange! The figure is dominated by blue and green markers! The model therefore transitions from predicted very low probabilities near zero to very high probabilities near 1 across a very "narrow" region. That region resembles a "shell" around a circle of very high probability.
sns.relplot( data = input_viz_small_copy, x='x1', y='x2', hue='ridge_pred_prob_bin',
kind='scatter', s=7.5 )
plt.show()
The figure below includes the small training set and uses the marker shape (style semantic variable) to denote the observed class. All observed EVENTS are contained in the green circle associated with very high predicted probability.
fig, ax = plt.subplots()
sns.scatterplot(data = input_viz_small_copy, x='x1', y='x2', hue='ridge_pred_prob_bin', s=7.5, ax=ax)
sns.scatterplot( data = df_small, x='x1', y='x2', style='y', color='black', s=200, ax=ax)
plt.show()
Unpenalized or unregularized logistic regression crashes when the model is trying to create a step change. The RIDGE penalty therefore allows the logistic regression model to create a "fuzzy" or "smoothed out" decision boundary. The probability transitions from low values to high values over a narrow input interval.
Why are we spending so much time on logistic regression if this notebook is dedicated to Support Vector Machines? The previous example illustrated several major important steps. First, we generated features from the inputs. Second, those features created linearly separated output classes which caused unpenalized logistic regression to crash. Third, we needed regularization to enable solving the problem. Those three pieces are the essential building blocks of Support Vector Machines! SVMs therefore share a lot in common with penalized logistic regression! SVMs are based on hard margin classifiers. A hard margin classifier is trying to find the optimal separating boundary between the EVENTS and NON-EVENTS. Thus, hard margin classifiers overcome the linear separation issue experienced by unpenalized logistic regression. However, as we have seen the separating boundary may not be a straight line! The separating boundary could be a complex curve. Support Vector Machines use the kernel trick to generate features via kernal functions. That sounds like a complex process, but it essentially means derive features from the inputs.
The type of features generated by the model depends on the kernel function. There are many kernel functions to choose from. Some are simple and some are complex. A starting point is the polynomial kernel which enables creating the same kind of polynomials we created manually in the example above! Thus, SVMs will generate polynomials "automatically" if we tell the SVM to use the polynomial kernel. Let's create a SVM with a second degree (quadratic) polynomial kernel. The sklearn SVM for classification is imported below.
from sklearn.svm import SVC
Let's initialize the SVM object. By default SVMs do not return the predicted probabilities. They require extra work to estimate that probability compared to logistic regression. We therefore need to instruct the logistic regression model to return the probabilities via the probability argument to SVC(). The kernel function is specified by the kernel argument. We also need to specify the gamma and C arguments, but those will be discussed in more detail later. For now, those two arguments are set to two values below. The polynomial kernel is created with kernel = 'poly' and the degree argument dictates the degree of the polynomial. The second degree polynomial kernel SVM is initialized below.
svm_poly_hardmargin = SVC(gamma='scale', kernel='poly', degree=2, C=100000, probability=True)
We do not need to generate the features with PolynomialFeatures()! The SVM does that for us! However, let's continue to use Pipelines to be consistent with the ridge penalized logistic regression model we fit previously.
svm_poly_hardmargin_wflow = Pipeline( steps=[('hm', svm_poly_hardmargin )])
The SVM is fit using the small training data below.
svm_poly_hardmargin_fit = svm_poly_hardmargin_wflow.fit( X_small, y_small )
The steps of the fitted object are shown below.
svm_poly_hardmargin_fit.steps
[('hm', SVC(C=100000, degree=2, kernel='poly', probability=True))]
We can access the underlying SVM model with the .named_steps[] attribute.
svm_poly_hardmargin_fit.named_steps['hm']
SVC(C=100000, degree=2, kernel='poly', probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=100000, degree=2, kernel='poly', probability=True)
Let's make predictions on the small visualization grid. The probability predictions are organized the same way as the ridge penalized logistic regression predictions.
svm_poly_hardmargin_fit.predict_proba( X_viz_small )[:6, :]
array([[9.99999844e-01, 1.55986283e-07],
[9.99999771e-01, 2.29078569e-07],
[9.99999666e-01, 3.34354648e-07],
[9.99999515e-01, 4.85014940e-07],
[9.99999301e-01, 6.99242341e-07],
[9.99998998e-01, 1.00190183e-06]])
svm_poly_hardmargin_fit.classes_
array([0, 1], dtype=int64)
The predictions from the SVM are compiled with the other predictions below.
input_viz_small_copy['svm_pred_prob'] = svm_poly_hardmargin_fit.predict_proba( X_viz_small )[:, 1].ravel()
The predicted probabilities associated with the 2nd degree polynomial kernel are shown below. The predictions are not exactly the same as our ridge penalized logistic regression with main effects, interaction, and quadratic features. However, both show a roughly circular (or eliptic to be more exact) decision boundary!
sns.relplot( data = input_viz_small_copy, x='x1', y='x2', hue='svm_pred_prob',
kind='scatter', palette='vlag', s=7.5 )
plt.show()
All observed EVENTS are contained within the ellipse as shown below.
fig, ax = plt.subplots()
sns.scatterplot(data = input_viz_small_copy, x='x1', y='x2', hue='svm_pred_prob', palette='vlag', s=7.5, ax=ax)
sns.scatterplot( data = df_small, x='x1', y='x2', style='y', color='black', s=200, ax=ax)
plt.show()
The figure below directly compares the ridge penalized logistic regression prediction to the polynomial kernel SVM predictions. Both models have the higher probabilities within the circle/ellipse. The circle/ellipse white boundary corresponds to probabilities near 50%.
sns.relplot(data = input_viz_small_copy.loc[:, ['x1', 'x2', 'ridge_pred_prob', 'svm_pred_prob']].melt(id_vars=['x1', 'x2'], ignore_index=True),
x='x1', y='x2', hue='value', col='variable', palette='vlag', s=7)
plt.show()
Again the surfaces are not identical. There are differences between the way the two models are fit. However, both models share the ideas of generating features and trying to find the separating region between the EVENTS and NON-EVENTS based on those features.
The hard margin classifier is motivated to find the one optimal separating boundary between the EVENTS and NON-EVENTS. However, in practice we often do not have hard distinct boundaries between output classes. Instead there are often overlap between the EVENTS and NON-EVENTS. The hard margin classifier breaks down and thus does not work for situations where logistic regression works just fine! The hard margin classifier overcomes this limitation by becoming the soft margin classifier which allows for miss-classifications. SVMs therefore are really extensions of the soft margin classifier. The C parameter within SVC() controls the decision boundary behavior by controlling if the SVM is more like a hard margin classifier or a soft margin classifier. We will now consider the influence of the C by returning to the larger data set that includes all three inputs. The figure below shows the scatter plot between x2 and x1 faceted by x3 and colored by the binary outcome. As shown below, no clear hard separating boundary exists in the data! There are regions with more EVENTS than NON-EVENTS as well as regions with more NON-EVENTS than EVENTS. But, there is way to draw a single curve that separates the EVENTS from NON-EVENTS in all three facets. We therefore need to use the soft margin classifier idea with the SVM's ability to generate features automatically to create a classifier for this more challenging example.
sns.relplot(data = df, x='x1', y='x2', hue='y', col='x3', s=75)
plt.show()
However, SVMs cannot work with categorical inputs. We thus need to generate the dummy variables from the categorical input! SVMs also prefer continuous inputs to be standardized. The continuous inputs in this example are roughly the same scale. However, let's standardize the continuous inputs for practice here. We thus need to create a complex preprocessing Pipeline which standardizes the continuous inputs and generates the dummy variables from the categorical input.
The StandardScaler(), OneHotEncoder() and ColumnTransformer() functions are imported below.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
The names of the continuous inputs and categorical input are defined below.
names_numeric_inputs = ['x1', 'x2']
names_cat_inputs = ['x3']
The inputs are separated from the entire training dataframe.
x_inputs_df = df.loc[:, names_numeric_inputs + names_cat_inputs].copy()
x_inputs_df.head()
| x1 | x2 | x3 | |
|---|---|---|---|
| 0 | -0.173789 | -0.750486 | A |
| 1 | -0.388312 | -0.361116 | A |
| 2 | -0.730844 | 0.141729 | A |
| 3 | -1.336667 | 1.883643 | A |
| 4 | -0.443531 | 2.134529 | A |
The numeric input Pipeline is defined below.
numeric_transform = Pipeline( steps = [('std_input', StandardScaler())] )
The categorical input Pipeline is defined below.
categorical_transform = Pipeline( steps = [ ('dummy', OneHotEncoder(drop='first') )] )
Compose the two Pipeline transformations together into the composite pipeline.
preprocess_transform = ColumnTransformer( transformers=[ ('num', numeric_transform, names_numeric_inputs),
('cat', categorical_transform, names_cat_inputs) ] )
The .transformers attribute shows us which transformations are applied to which set of inputs.
preprocess_transform.transformers
[('num', Pipeline(steps=[('std_input', StandardScaler())]), ['x1', 'x2']),
('cat', Pipeline(steps=[('dummy', OneHotEncoder(drop='first'))]), ['x3'])]
The transformation can be applied to the training inputs. The zero columns at the end correspond to the dummy variables! Two dummy variables are created because there are 3 categories for the categorical input!
preprocess_transform.fit_transform( x_inputs_df )[:6, :]
array([[-0.1679734 , -0.79704121, 0. , 0. ],
[-0.37557672, -0.383967 , 0. , 0. ],
[-0.70706077, 0.14949122, 0. , 0. ],
[-1.29334209, 1.99745207, 0. , 0. ],
[-0.42901543, 2.26361145, 0. , 0. ],
[ 0.2878375 , -1.3216246 , 0. , 0. ]])
C parameter¶Let's now define the Pipeline for the second degree polynomial kernel SVM combined with the composite preprocessing operations. We will first use the very high value of C.
svm_poly_highC_wflow = Pipeline( steps = [('prepro', preprocess_transform),
('svm', svm_poly_hardmargin)] )
Fit the SVM with the high value of C.
svm_poly_highC_fit = svm_poly_highC_wflow.fit( x_inputs_df, df.y.to_numpy().ravel() )
Again, we can access the underlying objects with the .named_steps[] attribute.
svm_poly_highC_fit.named_steps['svm']
SVC(C=100000, degree=2, kernel='poly', probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=100000, degree=2, kernel='poly', probability=True)
svm_poly_highC_fit.classes_
array([0, 1], dtype=int64)
However, let's examine the behavior of the model via predictions. The cell below defines the full factorial grid between all 3 inputs. The lower and upper bounds of the continuous inputs were increased to -3 and 3 in the cell below. Also, the categorical input is included with all three unique values.
input_viz = pd.DataFrame([ (xa, xb, xc) for xa in np.linspace(-3, 3, num=101)
for xb in np.linspace(-3, 3, num=101)
for xc in df.x3.unique()],
columns=['x1', 'x2', 'x3'])
input_viz
| x1 | x2 | x3 | |
|---|---|---|---|
| 0 | -3.0 | -3.00 | A |
| 1 | -3.0 | -3.00 | B |
| 2 | -3.0 | -3.00 | C |
| 3 | -3.0 | -2.94 | A |
| 4 | -3.0 | -2.94 | B |
| ... | ... | ... | ... |
| 30598 | 3.0 | 2.94 | B |
| 30599 | 3.0 | 2.94 | C |
| 30600 | 3.0 | 3.00 | A |
| 30601 | 3.0 | 3.00 | B |
| 30602 | 3.0 | 3.00 | C |
30603 rows × 3 columns
The predicted classifications from the high C SVM are executed below.
input_viz_copy = input_viz.copy()
input_viz_copy['pred_svm_highC_class'] = svm_poly_highC_fit.predict( input_viz )
input_viz_copy
| x1 | x2 | x3 | pred_svm_highC_class | |
|---|---|---|---|---|
| 0 | -3.0 | -3.00 | A | 0 |
| 1 | -3.0 | -3.00 | B | 0 |
| 2 | -3.0 | -3.00 | C | 0 |
| 3 | -3.0 | -2.94 | A | 0 |
| 4 | -3.0 | -2.94 | B | 0 |
| ... | ... | ... | ... | ... |
| 30598 | 3.0 | 2.94 | B | 0 |
| 30599 | 3.0 | 2.94 | C | 0 |
| 30600 | 3.0 | 3.00 | A | 0 |
| 30601 | 3.0 | 3.00 | B | 0 |
| 30602 | 3.0 | 3.00 | C | 0 |
30603 rows × 4 columns
There are just 2 classes and the pred_svm_highC_class has 2 unique values.
input_viz_copy.pred_svm_highC_class.value_counts()
0 26735 1 3868 Name: pred_svm_highC_class, dtype: int64
The classifications are visualized below. All three facets, corresponding to different categorical input values, show roughly the same classifications! The EVENTS are within a circular shape in the middle!
sns.relplot(data = input_viz_copy, x='x1', y='x2', hue='pred_svm_highC_class', col='x3', s=7)
plt.show()
What if we would instead use a LOW C value? The cell below defines a 2nd degree polynomial kernel SVM with a very low C.
svm_poly_lowC = SVC(gamma='scale', kernel='poly', degree=2, C=0.0001, probability=True)
The low C workflow is defined below.
svm_poly_lowC_wflow = Pipeline( steps = [('prepro', preprocess_transform),
('svm', svm_poly_lowC)] )
The model is fit.
svm_poly_lowC_fit = svm_poly_lowC_wflow.fit( x_inputs_df, df.y.to_numpy().ravel() )
Predicted classifications are made on the visualization grid below.
input_viz_copy['pred_svm_lowC_class'] = svm_poly_lowC_fit.predict( input_viz )
The predicted classifications from the very low C value SVM are visualized below.
sns.relplot(data = input_viz_copy, x='x1', y='x2', hue='pred_svm_lowC_class', col='x3', s=7)
plt.show()
The very low C cause the model to never classify the EVENT!!!! The model is "ok" with miss-classifications on the training set because low C values are associated with the soft margin classifier! Very low values of C allow for many missclassifications, leading to worse training set performance. Very high values of C do not allow for missclassifications which leads to high training set performance.
The training set Accuracy (the default .score() method) is shown below for the two values of C. The training set accuracy is higher for the high value of C compared to the low value of C.
svm_poly_highC_fit.score( x_inputs_df, df.y.to_numpy().ravel() )
0.75
svm_poly_lowC_fit.score( x_inputs_df, df.y.to_numpy().ravel() )
0.5533333333333333
What happens if we try a moderate value of C? For example, the default C value is 1. As shown below, the training set Accuracy associated with the default value of C is relatively close to the very high value of C.
svm_poly_defaultC = SVC(gamma='scale', kernel='poly', degree=2, C=1, probability=True)
svm_poly_defaultC_wflow = Pipeline( steps = [('prepro', preprocess_transform),
('svm', svm_poly_defaultC)] )
svm_poly_defaultC_fit = svm_poly_defaultC_wflow.fit( x_inputs_df, df.y.to_numpy().ravel() )
svm_poly_defaultC_fit.score( x_inputs_df, df.y.to_numpy().ravel() )
0.7366666666666667
What do the predictions look like for the default value of C? As shown below, the default value of C results in model behavior similar to the very high value of C in this case.
input_viz_copy['pred_svm_defaultC_class'] = svm_poly_defaultC_fit.predict( input_viz )
sns.relplot(data = input_viz_copy, x='x1', y='x2', hue='pred_svm_defaultC_class', col='x3', s=7)
plt.show()
C parameter¶Which value of C is better? We do not know from the training set behavior! Remember we do not care about training set performance! We care about performance on new data! Thus, we do not know which value of C is best based on training set behavior alone. The C parameter is the tuning parameter associated with SVMs! We must use cross-validation to tune C by maximizing the performance on new data on average.
Essentially, you can think of the SVM C parameter just like the C parameter in regularized logistic regression! The C parameter controls the regularization strength. High values of C put more weight on maximizing training set performance while low values of C correspond to driving all coefficients to zero! The SVM therefore shares a lot in common with regularized logistic regression!
Let's now tune the 2nd degree polynomial kernel SVM with grid search. We will specify a candidate set of possible C values ranging from very small to very large. The GridSearchCV function and StratifiedKFold functions are imported below. We need stratified cross-validation because we are working with a classification problem.
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
Let's use 5-fold CV for this application.
my_cv = StratifiedKFold(n_splits=5, random_state=101, shuffle=True)
Let's define a workflow which combines the composite preprocessing actions with a 2nd degree polynomial kernel. The C argument is not specified in the initialized SVC() object below.
svm_poly_2_wflow = Pipeline( steps = [('prepro', preprocess_transform),
('svm', SVC(gamma='scale', kernel='poly', degree=2, probability=True))] )
The tuning grid is specified following the naming conventions from Pipelines. The C parameter increases from 0.001 to 1000.
svm_poly_2_tune_grid = {'svm__C': [0.001, 0.01, 0.1, 1.0, 10, 100, 1000]}
The grid search cross-validation object is initialized below.
svm_poly_2_search = GridSearchCV( svm_poly_2_wflow, svm_poly_2_tune_grid, cv=my_cv)
The cross-validation is applied to tune the model below.
svm_poly_2_search_results = svm_poly_2_search.fit(x_inputs_df, df.y.to_numpy().ravel())
The best value for C is the one that maximizes the average hold-out set Accuracy (the default scoring method).
svm_poly_2_search_results.best_params_
{'svm__C': 10}
The best average hold-out set Accuracy is shown below. On average the tuned value of C produces a model that has a hold-out set Accuracy of 74%.
svm_poly_2_search_results.best_score_
0.7433333333333334
As a quick check, how does this performance compare to the proportion of events in the whole data set? The output is roughly balanced so this is a model that is doing better than just random guessing!
df.y.value_counts(normalize=True)
0 0.553333 1 0.446667 Name: y, dtype: float64
The best model is available within the .best_estimator_ attribute. You can go through the widge dialogue boxes to examine the individual steps within the model.
svm_poly_2_search_results.best_estimator_
Pipeline(steps=[('prepro',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('std_input',
StandardScaler())]),
['x1', 'x2']),
('cat',
Pipeline(steps=[('dummy',
OneHotEncoder(drop='first'))]),
['x3'])])),
('svm', SVC(C=10, degree=2, kernel='poly', probability=True))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('prepro',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('std_input',
StandardScaler())]),
['x1', 'x2']),
('cat',
Pipeline(steps=[('dummy',
OneHotEncoder(drop='first'))]),
['x3'])])),
('svm', SVC(C=10, degree=2, kernel='poly', probability=True))])ColumnTransformer(transformers=[('num',
Pipeline(steps=[('std_input',
StandardScaler())]),
['x1', 'x2']),
('cat',
Pipeline(steps=[('dummy',
OneHotEncoder(drop='first'))]),
['x3'])])['x1', 'x2']
StandardScaler()
['x3']
OneHotEncoder(drop='first')
SVC(C=10, degree=2, kernel='poly', probability=True)
The fitted GridSearchCV() object is a full model. It has the Pipelines embedded in it and thus we can provide new data and make predictions! Let's make predictions on the visualization grid with the tuned 2nd degree polynomial SVM.
input_viz_copy['pred_svm_tuneC_class'] = svm_poly_2_search_results.predict( input_viz )
input_viz_copy.head()
| x1 | x2 | x3 | pred_svm_highC_class | pred_svm_lowC_class | pred_svm_defaultC_class | pred_svm_tuneC_class | |
|---|---|---|---|---|---|---|---|
| 0 | -3.0 | -3.00 | A | 0 | 0 | 0 | 0 |
| 1 | -3.0 | -3.00 | B | 0 | 0 | 0 | 0 |
| 2 | -3.0 | -3.00 | C | 0 | 0 | 0 | 0 |
| 3 | -3.0 | -2.94 | A | 0 | 0 | 0 | 0 |
| 4 | -3.0 | -2.94 | B | 0 | 0 | 0 | 0 |
The figure below shows the classifications for the tuned model.
sns.relplot(data = input_viz_copy, x='x1', y='x2', hue='pred_svm_tuneC_class', col='x3', s=7)
plt.show()
You might be wondering "how did you know to try the 2nd degree polynomial kernel?" I selected the second degree kernel because we previously fit logistic regression models with quadratic features! However, in practice we won't know which polynomial degree to use! The polynomial degree is itself a tuning parameter! We can therefore use GridSearchCV() to tune BOTH the C parameter and the polynomial degree, degree, at the same time!
The cell below defines a new workflow which combines the composite preprocessing action with the polynomial kernel SVM. This workflow does not specify the degree argument to SVC().
svm_poly_wflow = Pipeline( steps = [('prepro', preprocess_transform),
('svm', SVC(gamma='scale', kernel='poly', probability=True))] )
This allows us to include the svm__degree as a tuning parameter in the tuning grid. The cell below defines 3 values for the svm__degree parameter. We will try a first degree, second degree, and third degree polynomial kernel. I usually do not go above a third degree kernel with SVM. The same candidate C values are included.
svm_poly_tune_grid = {'svm__degree': [1, 2, 3],
'svm__C': [0.001, 0.01, 0.1, 1.0, 10, 100, 1000]}
The GridSearchCV() object is initialized below.
svm_poly_search = GridSearchCV( svm_poly_wflow, svm_poly_tune_grid, cv=my_cv)
The grid search is executed!
svm_poly_search_results = svm_poly_search.fit(x_inputs_df, df.y.to_numpy().ravel())
The best tuning parameter correponded to the 2nd degree kernel we tried previously!
svm_poly_search_results.best_params_
{'svm__C': 10, 'svm__degree': 2}
The tuned model's average hold out set Accuracy is thus the same as we saw previously.
svm_poly_search_results.best_score_
0.7433333333333334
gamma parameter¶The SVC() function includes a gamma argument which we have assigned to the value 'scale'. The gamma argument is a scaling factor applied to the features. The documentation discusses the allowable values for the argument:
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
The default value of 'scale' calculates gamma based on the variance of the features. The default value of 'scale' is calculated as 1 divided by the number of features times the variance of the entire training feature set. This is part of why it is so important to standardize all continuous inputs before fitting the SVM.
We can generate the training feauture array by calling the composite preprocessor. As a reminder the head of the composite preprocessed inputs is provided below.
preprocess_transform.fit_transform( x_inputs_df )[:6, :]
array([[-0.1679734 , -0.79704121, 0. , 0. ],
[-0.37557672, -0.383967 , 0. , 0. ],
[-0.70706077, 0.14949122, 0. , 0. ],
[-1.29334209, 1.99745207, 0. , 0. ],
[-0.42901543, 2.26361145, 0. , 0. ],
[ 0.2878375 , -1.3216246 , 0. , 0. ]])
The default value of gamma for our example is therefore:
1 / (preprocess_transform.fit_transform( x_inputs_df )[:6, :].shape[1] * preprocess_transform.fit_transform( x_inputs_df )[:6, :].var())
0.42107705331402917
Behind the scenes this value for gamma is used for fitting the models. However, we could tune it! Including gamma as a tuning parameter is something you should only consider if really want to try and maximize the performance of the SVM. I would recommend first tuning the polynomial degree and then trying to tune gamma associated with that tuned polynomial degree. However, you should also tune C with gamma.
The cell below defines a new workflow that we will use to tune C and gamma for the 2nd degree polynomial kernel.
svm_poly_2_final_wflow = Pipeline( steps = [('prepro', preprocess_transform),
('svm', SVC(kernel='poly', degree=2, probability=True))] )
The tuning grid is specified to try a range of values of gamma around the default value we manually calculated above. I manually typed these unique values from something lower than the default value to something larger than the default value. The candidate C values are the same as before.
svm_poly_2_final_tune_grid = {'svm__gamma': [0.1, 0.2, 0.4, 0.8, 1.6, 3.2],
'svm__C': [0.001, 0.01, 0.1, 1.0, 10, 100, 1000]}
The GridSearchCV() object is initialized below.
svm_poly_2_final_search = GridSearchCV( svm_poly_2_final_wflow, svm_poly_2_final_tune_grid, cv=my_cv)
The grid search is executed!
svm_poly_2_final_search_results = svm_poly_2_final_search.fit(x_inputs_df, df.y.to_numpy().ravel())
The tuned C and gamma values are shown below. The new tuned C parameter is different from what we found previously since a different tuned value for gamma was identified!
svm_poly_2_final_search_results.best_params_
{'svm__C': 0.01, 'svm__gamma': 0.8}
This model is slightly better that the previous model that used the default gamma='scale' value.
svm_poly_2_final_search_results.best_score_
0.7633333333333334
Predictions from this final 2nd degree polynomial are made and visualized below.
input_viz_copy['pred_svm_2_final_class'] = svm_poly_2_final_search_results.predict( input_viz )
sns.relplot(data = input_viz_copy, x='x1', y='x2', hue='pred_svm_2_final_class', col='x3', s=7)
plt.show()
There are other kernels that can be tried. One particularly popular kernel is the radial basis function or RBF. This kernel is especially good at handling smooth relationships. It will create very complicated relationships and enables considering many kinds of interactions between the inputs. The RBF kernel has the C tuning parameter and gamma tuning parameter just as the polynomial kernel. However, the RBF kernel does not have the degree tuning parameter. The degree tuning parameter is specific to the polynomial kernel.
We can use the default gamma='scale' for the RBF kernel. If we do, the default gamma will be calculated just as we discussed for the polynomial kernel. Thus we know for our application that gamma is calculated to be something near 0.4. For those reasons, let's tune both gamma and C for the RBF kernel.
The SVM with the RBF kernel is initialized within the Pipeline below.
svm_rbf_wflow = Pipeline( steps = [('prepro', preprocess_transform),
('svm', SVC(kernel='rbf', probability=True))] )
The RBF kernel tuning grid is created below. An expanded set of gamma and C values are used for the RBF kernel compared to the polynomial kernel.
svm_rbf_tune_grid = {'svm__gamma': [0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4],
'svm__C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5, 10, 50, 100, 500, 1000]}
The grid search is initialized below.
svm_rbf_search = GridSearchCV( svm_rbf_wflow, svm_rbf_tune_grid, cv=my_cv)
The grid search is executed!
svm_rbf_search_results = svm_rbf_search.fit(x_inputs_df, df.y.to_numpy().ravel())
The best tuning parameters are:
svm_rbf_search_results.best_params_
{'svm__C': 10, 'svm__gamma': 0.2}
The best average hold-out set Accuracy is higher than that associated with the polynomial kernels we tried previously!
svm_rbf_search_results.best_score_
0.8366666666666667
Again, let's use predictions to understand the behavior of the RBF kernel. Predictions are made below.
input_viz_copy['pred_svm_rbf_class'] = svm_rbf_search_results.predict( input_viz )
Visualizing the predicted classification reveals that the RBF kernel allowed the decision surface to change or interact with the categorical input! The EVENTS are classified in a small circular region for x3 = 'A'. The other two x3 categories are associated with EVENTS along the diagonal of the plot! Thus, the behavior associated with the continuous inputs depends on the categorical input!!!
sns.relplot(data = input_viz_copy, x='x1', y='x2', hue='pred_svm_rbf_class', col='x3', s=7)
plt.show()
The RBF kernel is particularly good at dealing with smooth behavior, especially when interactions are present.
The best SVM corresponded to the RBF kernel. It was better than the other polynomial kernels because it allowed for interactions between the categorical and continuous inputs. However, we know how to generate interactions. Could we not have trained a regularized logistic regression model that captured this same kind of behavior?
Let's find out!
Let's create a composite Pipeline that interacts the categorical variable with main effects, interactions, and quadratic features associated with the continuous inputs.
The preprocess_transform object creates the dummies and standardizes the continuous inputs. However, let's define a new numeric pipeline which standardizes, generates the polynomial features, and then standardizes those polynomial features again. The include_bias argument is set to False that way the "intercept column of 1s" or the BIAS column is not included.
numeric_poly_transform = Pipeline( steps = [('std_input', StandardScaler()),
('make_poly', PolynomialFeatures(degree=2, include_bias=False)),
('std_poly', StandardScaler() )] )
Let's now define a composite transform which combines the dummy variable creation and the polynomial feature generation.
prepro_poly_transform = ColumnTransformer( transformers=[ ('num', numeric_poly_transform, names_numeric_inputs),
('cat', categorical_transform, names_cat_inputs) ] )
Interacting the categorical input with all continuous features, requires generating polynomial features! The cell below defines one more Pipeline which generates all main effects and pair wide products.
prepro_all_pairs = Pipeline( steps=[ ('prepro', prepro_poly_transform),
('make_pairs', PolynomialFeatures(degree=2, include_bias=False, interaction_only=True))])
Let's check the number of features associated with this model! As shown below, there are 28 columns or features created!
prepro_all_pairs.fit_transform( x_inputs_df ).shape
(300, 28)
Is this correct? Let's use dmatrices() to check! The dmatrices() function is imported below.
from patsy import dmatrices
The formula interface is used to interact the categorical input with the continuous input main effects, interactions, and quadratic features. The return_type argument is set to dataframe to provide the design matrix as a Pandas dataframe rather than a NumPy array.
y_check, X_check = dmatrices('y ~ x3 * ( x1 * x2 + np.power(x1, 2) + np.power(x2, 2) )', data = df, return_type='dataframe')
How many columns are present in the dmatrices() created design matrix? As shown below, there are only 18 columns present!
X_check.shape
(300, 18)
The formula interface was used to generate the features in dmatrices(). The INTERCEPT column of 1s or BIAS column is included by default. Thus, the categorical input is converted to dummies. The column names associated with the returned dataframe are displayed below. The dummies are interacted with all continuous features.
X_check.columns
Index(['Intercept', 'x3[T.B]', 'x3[T.C]', 'x1', 'x3[T.B]:x1', 'x3[T.C]:x1',
'x2', 'x3[T.B]:x2', 'x3[T.C]:x2', 'x1:x2', 'x3[T.B]:x1:x2',
'x3[T.C]:x1:x2', 'np.power(x1, 2)', 'x3[T.B]:np.power(x1, 2)',
'x3[T.C]:np.power(x1, 2)', 'np.power(x2, 2)', 'x3[T.B]:np.power(x2, 2)',
'x3[T.C]:np.power(x2, 2)'],
dtype='object')
So why did our Pipelines result not match the dmatrices() result? First, the intercept is included. But that's just one column! That does not explain the difference!
Let's check the number of unique values for all columns in the Pipelines version. Columns 5 and 6 have two unique values. Those are the two dummy variable columns! Columns 27 has 1 unique value!
pd.DataFrame(prepro_all_pairs.fit_transform( x_inputs_df )).nunique()
0 300 1 300 2 300 3 300 4 300 5 2 6 2 7 300 8 300 9 300 10 300 11 101 12 101 13 300 14 300 15 300 16 101 17 101 18 300 19 300 20 101 21 101 22 300 23 101 24 101 25 101 26 101 27 1 dtype: int64
As shown below, the last column has 1 unique value equal to 0. This column corresponds to the product of the two dummy variables. We know that because at most 1 of the dummy columns can equal 1. Thus, the product of the dummies will also be zero.
pd.DataFrame(prepro_all_pairs.fit_transform( x_inputs_df )).iloc[:, 27].value_counts()
0.0 300 Name: 27, dtype: int64
The last column is thus a constant value and cannot be used to learn anything relationships. The last column should therefore be removed!
However, why are there so many additional columns in our Pipelines? We included PolynomialFeatures() with degree=2 and interact_only=True. This function interacts all columns in the design matrix. Therefore, all continuous features are interacted! We are thus creating the product of the quadratic terms! We are thus creating the product of the quadratic terms with the interaction terms! We are including more products than we originally intended!
What if we created polynomial features with the dummies combined with the numeric inputs. Let's try that now.
prepro_make_pairs = Pipeline( steps = [('prepro', preprocess_transform),
('make_pairs', PolynomialFeatures(degree=2, include_bias=False))])
How many features are present now? As shown below, there are 14 features! We removed the multiplication of the "higher order" features, but now we're missing some features!
prepro_make_pairs.fit_transform( x_inputs_df ).shape
(300, 14)
The number of unique values per feature are shown below. We again see the dummy variables have 2 unique values. And there is a column with 1 unique value due to the product of the dummies.
pd.DataFrame(prepro_make_pairs.fit_transform( x_inputs_df )).nunique()
0 300 1 300 2 2 3 2 4 300 5 300 6 101 7 101 8 300 9 101 10 101 11 2 12 1 13 2 dtype: int64
We are missing the features which interact the higher degree polynomials with the dummies. We know this because the quadratic features are generated with the rest of the interaction features.
Essentially, Pipelines is helpful for creating the interactions but it is challenging to exactly replicate some of the custom formulas we can easily type in statsmodels. It is therefore especially critical to use regularization to try and manage the extra features generated which we allow all contiuous features generated from Pipelines to vary across all categories.
Thus, let's use all pairs which includes more features than we originally intended. We will tune Elastic Net to try and turn off unimportant features. We will tune both C and the mixing fraction between ridge and lasso. The elastic net model is initialized below.
enet_to_fit = LogisticRegression(penalty='elasticnet', solver='saga', random_state=101, max_iter=25001, fit_intercept=True)
The elastic net tuning grid is defined below with a relatively small number of C values.
enet_grid = {'enet__C': np.exp(np.linspace(-6, 6, num=11)),
'enet__l1_ratio': np.linspace(0, 1, num=5)}
The workflow for the elastic net model applied to all pairs of features is defined below.
enet_pairs_wflow = Pipeline( steps = [('prepro', prepro_all_pairs),
('enet', enet_to_fit)] )
The grid search is initialized.
enet_pairs_search = GridSearchCV( enet_pairs_wflow, enet_grid, cv=my_cv)
The grid search is executed!
enet_pairs_search_results = enet_pairs_search.fit(x_inputs_df, df.y.to_numpy().ravel())
The tuned Elastic Net turned into Lasso!
enet_pairs_search_results.best_params_
{'enet__C': 1.0, 'enet__l1_ratio': 1.0}
The best Elastic Net model has an averaged hold-out set Accuracy of about 85%!
enet_pairs_search_results.best_score_
0.8533333333333333
How many coefficients have been forced to zero?
enet_pairs_search_results.best_estimator_.named_steps['enet'].coef_
array([[ 0. , 0. , -2.51251861, 0. , -2.493116 ,
0. , 0. , 0. , 0. , 0. ,
0.21784521, -0.28300923, 0.20681073, 0. , 0.09812563,
0.23528396, 0.25879634, -0.09496183, 0. , 0.30480021,
0. , 0. , 0.26892086, 2.43019674, -3.23230265,
0.01559593, 0.94699371, 0. ]])
There are 15 non-zero coefficients!!!!
enet_pairs_search_results.best_estimator_.named_steps['enet'].coef_[enet_pairs_search_results.best_estimator_.named_steps['enet'].coef_ != 0].size
15
Lastly, let's predict with this model to interpret its behavior.
input_viz_copy['pred_enet_pairs_class'] = enet_pairs_search_results.predict( input_viz )
The elastic net model has a similar set of predicted classifications as the tuned SVM RBF. The decision surface depends on the categorical input!
sns.relplot(data = input_viz_copy, x='x1', y='x2', hue='pred_enet_pairs_class', col='x3', s=7)
plt.show()
We fit multiple types of binary classification models in this notebook. The primary intent was to introduce the key arguments you should consider when working with Support Vector Machine (SVM) classifiers. However, let's compile the cross-validation results associated with the best tuned models and compare across all grid search results.
The function defined below will help compile all results.
def extract_model_performance(a_mod, a_name, num_folds):
res = pd.DataFrame({'mean_test_score': a_mod.cv_results_['mean_test_score'],
'std_test_score': a_mod.cv_results_['std_test_score']})
res['mean_test_score_se'] = res.std_test_score / np.sqrt( num_folds )
res['model_name'] = a_name
res.drop(['std_test_score'], axis=1, inplace=True)
res_best = res.loc[ res.mean_test_score == res.mean_test_score.max(), :].copy()
return res_best
Let's extract the best performance associated with each of the grid search results.
all_compare = pd.concat([extract_model_performance(svm_poly_search_results, 'SVM Poly', my_cv.get_n_splits()),
extract_model_performance(svm_poly_2_final_search_results, 'SVM Tuned degree 2', my_cv.get_n_splits()),
extract_model_performance(svm_rbf_search_results, 'SVM RBF', my_cv.get_n_splits()),
extract_model_performance(enet_pairs_search_results, 'ENET All Pairs', my_cv.get_n_splits())])
The Elastic Net model with the large number of interactions is the overall best model! We generated all features associated with this model and so it is easier to interpret compared to the SVM RBF which generates the strange radial basis function features for us. This highlights that we should always consider generating the features ourselves with simpler methods! More advanced methods may not always work out better!
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(15, 8))
ax.errorbar(y = all_compare.model_name,
x = all_compare.mean_test_score,
xerr = all_compare.mean_test_score_se,
fmt = 'o', color = 'black', ecolor = 'black',
elinewidth=3, ms=10)
ax.set_xlabel('Cross-validation average hold-out set score')
ax.set_ylabel('Model')
plt.show()